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Abstract 


Techniques  for  the  solution  of  full  systems  of  simultaneous 
equations  represent  an  important  algorithm  class  for  vector  pro- 
cessors. This  report  considers  the  data  flows  involved  in  solving 
a full  equation  system  on  the  Cray-1.  This  involves  study  of  the 
I/O  and  memory- processor  path  traffic  vis-a-vis  the  capabilities 
of  the  Cray-1  to  support  it.  The  I/O  is  found  to  present  problems 
for  small  systems  of  equations  and  in  the  substitution  phases  of 
small  and  large  systems.  Using  an  algorithm  proposed  in  the 
report,  the  memory-processor  path  is  shown  to  have  excess  bandwidth. 
Suggestions  are  made  for  utilicing  this  bandwidth  to  increase  the 
arithmetic  operation  rate  by  modifying  and  expanding  the  pro- 
cessor architecture. 
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I.  Introduction 


The  communication  links  between  the  major  components  of  the 
Cray  1 system  - the  processor,  main  memory,  and  mass  memory  devices 
(backing  store)  - are  the  architectural  features  most  critical  to 
algorithmic  development  after  the  vectorization  feature  itself. 

Only  a single  data  path  exists  between  memory  and  processor,  so 
that  even  though  the  16-way  interleaved  memory  has  capacity  to 
transfer  words  at  four  times  the  clock  speed,  only  h of  this  speed 
is  available  to  the  processor.  This  leaves  potenetially  3/ 4 of 
the  memory  bandwidth  (i.e.,  240  megawords/sec.)  for  I/O.  However, 
the  I/O  channels  have  capacity  for  only  12.9  megawords/sec . when 
operating  with  a full  complement  of  24  discs.  The  potential  for 
a communications  bottleneck  on  both  sides  of  memory  is  therefore 
significant. 

One  can  claim  that  in  fact  a three-level  memory  heirarchy 
exists,  by  viewing  the  vector  register  set  as  functionally  a 
single  or  dual  cache.  However,  the  register  and  functional  unit 
speeds  are  matched,  so  that  no  communication  problem  exists  at 
this  end  of  the  memory  heirarchy. 

Algorithmic  remedies  for  these  communication  problems  are  of 
the  general  philosophy  that  computational  complexity  should  be 
maximized  on  data  at  each  memory  level.  If  the  total  arithmetic 
complexity  is  independent  of  data  organization  in  the  memory 
levels,  this  strategy  insures  that  the  fewest  data  movements  are 
required  when  maximum  computational  use  is  made  of  data  at  each 
level.  Practically,  this  procedure  involves  vector  and  small 
matrix  operations  like  inner-  and  outer-  products  on  data  in  the 
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register  cache  before  sending  the  result  to  main  memory,  and 
large  matrix  operations  on  data  in  main  memory  before  communication 
with  backing  store. 

This  study  will  be  concerned  with  examining  the  data  flows 
associated  with  solution  of  a full  set  of  linear  simultaneous 
equations.  It  happens  that,  in  movement  and  operations  with  large 
data  blocks,  programming  effort  is  increased  to  handle  incomplete 
blocks  and  intrablock  computation  (e.g.,  the  movement  across  the 
pivot  elements  in  sweeping  through  a column  strip  of  a matrix)  . 
However,  the  major  communication  problems  can  be  studied  by 
examining  two  critical  loops  in  an  equation  solution  algorithm: 

(1)  the  outermost  loop,  where  megaword  blocks  are  moved  between 
the  backing  store  and  main  memory  to  satisfy  the  gross  needs  of 
the  functional  units  on  the  other  side  of  memory  for  operands  and 
to  store  results,  and  (2)  the  innermost  mult ipl icat ion- subtract  ion 
loop  in  the  solution,  where  the  memory- to  - register  flow  is  critical. 

Before  proceeding,  it  should  be  noted  that  in  investigating 
methods  to  reduce  the  impact  of  an  apparent  data  flow  bottleneck, 
we  will  not  only  remove  the  flow  constriction  but  have  in  one 
case  by  algorithmic  means  exposed  excess  flow  capacity.  This  in 
turn  suggests  that  additional  functional  unit  capacity  could  be 
accomodated  - either  through  parallelism  or  faster  pipelines. 

If  similar  results  could  be  obtained  for  a sufficient  number  of 
major  applications,  this  would  indicate  that  future  architectural 
redesign  and  expansion  be  directed  toward  the  functional  units 
rather  than  the  data  paths,  heretofore  felt  to  be  the  most  critical 
architectural  feature. 
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II.  Evaluation  of  the  I/O  Bandwidth 

A.  Introduction  | 

It  is  easy  to  show  that  a Cray -1  is  capable  of  solving  a 

large  system  of  n linear  simultaneous  equations  in  approximately 
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4.8n  nanoseconds,  including  a 10  clock  overhead  in  the  innermost 
loop  to  divide  long  vectors  into  ones  of  length  64.  A system 
with  an  8 megabyte  main  memory  can  consequently  solve  1000 
equations  - filling  main  memory  - in  4.8  seconds.  Thus,  an  equation 
set  worthy  of  vector  processing  inevitably  involves  use  of  a 
backing  store . 

Smaller  full  matrices  are  often  encountered  as  part  of  larger 
sparse  matrix  solution;  these  full  components  reside  on  a 
backing  store,  arc  loaded  at  appropriate  Limes  during  the  overall 
solution  process,  and  undergo  a sequence  of  computations  similar 
to  the  factorization  of  a single  full  matrix.  Such  matrices  are 
typically  in  the  order  of  20  < n < 100. 

The  backing  store  itself  can  consist  of  between  1 and  12 
disc  units  on  both  input  and  output,  each  capable  of  a transfer 
rate  of  34.5  megabits-  This  yields  a total  input  capacity  of 
.54-6.48  mcgawords/second.  In  the  following  sections,we  will 
establish  that  this  rate  can  be  inadequate  to  support  the  gross 
computational  needs  of  the  arithmetic  fuctional  units  both  for  small 
and  for  very  large  sets  of  simultaneous  equations,  but  is  adequate 
for  many  typical  matrix  problems  of  intermediate  size. 

B.  Small  matrices  initially  on  backing  store 


To  read  a matrix  of  dimension  n with  m input  channels  from 
backing  store  requires  1.85  n^/m  ps.  Ibis  becomes  equal  to  the 
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solution  time  (4.8n  ns)  when  n=385/m  or  n=32  for  full  input 
channel  capacity.  For  smaller  values  of  n,  the  processor  will  be 
busied  nm/385  of  the  time,  a some  what  alarming  result  for  the 
range  20  < n < 100,  even  with  full  channel  activation. 

C.  Large  full  matrices 


In  the  I.U  triangular  factorization  of  a matrix  too  large  to 
be  contained  in  real  memory,  it  becomes  necessary  to  recall  from 
backing  store  factored  parts  of  the  matrix  to  participate  in 
operations  on  a part  of  the  matrix  currently  being  factored.  For 
the  sake  of  discussion,  it  will  be  assumed  that  the  matrix  is 
partitioned  into  column  strips,  so  that  factorization  takes  place 
on  a strip  currently  in  main  memory  by  recalling  previously- 
factored  strips. 

As  an  example,  in  Figure  1,  a full  matrix  of  size  n (=6)  is 

divided  into  k partitions  of  storage  S^.,  each  having  p = n/k 
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columns.  The  number  of  writes  to  backing  store  is  n , assuming 
that  the  entire  factored  matrix  must  reside  on  backing  store.  Th 


number  of  reads  is 
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Figure  1.  Counting  reads  in  a factorization 


„ V r * p2(k-r)2 

'r  r = l 2 

p(p-l) (k-l)(k)  p2(k-l) (k) (k-1/2) 

= 4 + 3 (1) 


With  total  storage  S and  strip  size  S^.,  then  k = S/S^  and  p = S^/\S. 
The  second  term  of  (1)  easily  dominates  the  expression,  and  this 
dominant  term  can  be  written* 


N s 
r 


-> 

S“ 


3S, 


2.3 
P k 


(2) 


With  half  of  main  memory  devoted  to  the  current  and  the  re- 
called strips,  the  fraction  a = (computation  time)/(l/0  time) 
can  be  computed  from  (2)  as 

a - 3.9  x 1 O^m 
n 


Thus,  a single  disc  could  supply  operands  up  to  n = 3900, 

which  requires  4.7  minutes  of  computation  time.  This  appears  more 

than  adequate  I/O  bandwidth. 


*Equation  (2)  also  applies  to  a large  class  of  sparse  equations  [ 
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D.  The  I/O  problem  in  the  substitution  process 


The  I/O  problem  is  proportionally  more  severe  for  the  forward 
and  back  substitution  steps,  where  only  a few  numeric  computations 
are  performed  with  each  element  of  the  recalled  factored  matrix. 

Consider  an  arithmetic  computation  sequence  where  K arithmetic 
computations  are  performed  on  the  average  on  every  L words  in  main 
memory,  by  a processor  with  an  operation  rate  of  M floating  point 
operations/second  (FLOPS).  Then  the  memory  must  be  supplied 
from  the  backing  store  at  the  rate  of 

ML/K  words /sec.  (3) 

In  the  forward  and  back  substitution  stages,  the  inner  loop 
instruction  will  be  of  the  general  form 

X ( T ) = X(I)  - LU(J)*YD  (4) 

where  LU(J)  contains  the  elements  of  L and  U.  Each  such  element  is 
used  a single  time  in  the  two  substitutions,  so  that  (ignoring 
array  X and  scalar  YD)  L = 1 and  K = 2 in  (3).  If  the  LU  array 
is  on  backing  store,  then  this  store  is  required  to  supply 
operands  for  (4)  at  M/ 2 words/sec. 

This  is  a prohibitive  rate,  as  evidenced  by  calculation  of 
Y = (arithmetic  computation  time)/(I/0  time) 

? (chained  multiply- subtract  time)/ (word  transfer  time) 

5 12. 5xl0'9m/1.85xl0'6 
= . 0068m 

Thus,  the  processor  will  be  busied  less  than  1 percent  of  the 
time  with  a single  disc  control,  and  less  than  10%  with  full 
input  channel  activation. 

This  is  a well-known  problem  in  the  solution  of  systems  that 
cannot  be  contained  in  main  memory;  because  of  the  Cray-1  processor 
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speed,  it  is  simply  aggravated.  It  is  an  interesting  aspect  of 
the  problem  that  for  the  general  class  of  sparse/full  matrices, 
the  above  ratio  is  independent  of  matrix  size,  density,  or  equation 
ordering . 


III.  Evaluation  of  Processor-Memory  Path 
A.  LU  factorization 

The  data  flow  between  the  memory  and  processor  is  associated 
with  the  inner  loops  of  the  triangular  factorization  process. 

As  a result,  it  becomes  necessary  to  consider  notation  and  details 
related  to  the  Eli  factorization. 

Let  the  matrix  A be  factored  into 

A = L U 

where 


One  method  of  performing  this  decomposition  on  the  Cray  1 was 
given  in  [2]  and  will  be  termed  the  inner  product  algorithm.  It 
is  described  as  follows. 

At  the  k pivot  step,  assume  that  the  u„  has  been  computed 
for  1 < i < k-1,  i < j < n,  and  that  have  been  formed  for 

lsjsk-1,  j+isisn. 


r 


Then  we  calculate 


lcop(T) 


loop(T) 


k-1 


l . , = a.,  - T • u . 

ik  lk  lrn  mk 

m=l 


i = k + 1 , . . . n 


(6) 


loop 


P© 


loop(jT) 


k-1 


ukj  = ak 


. - l .. 

3 m=l  m 3 


j = k, . . . n 


(7) 


i . - = t . . / u, 

ik  ik  Kk 


i = k + 1 , . 


(3) 


where  the  loops  are  indicated  in  Figure  2.  The  inner  product 
description  follows  from  loops  and  3,  above . 

The  inner  product  ^ethoH  reuuires  accessing  along  both  rows 
and  columns  and  as  such  it  may  be  necessary  to  skew  the  matrix  in 
memory  so  that  successive  accesses  are  not  made  from  the  same  memory 
bank . 

The  loop  0 portion  of  the  inner  product  inner  loop  was  implemented 
on  the  Cray- 1 by  loading  into  a vector  register  a portion  of  a row, 


u,  , k < m < k + 63.  This  load  was  chained  to  a vector  multiply  in- 

1,  m 


volving  2.,,  , . The  product  was  then  chained  to  a vector  subtract  from 

.c , 1 

the  ore- fetched  matrix  elements  a,  , k < m < k+63.  This  continued  with 

k , m 

ak,m  = ak,m  -*-k,2*U2,m  k * ni  * k + 63  (9) 

unti  1 


lk,m 


ak,in  ' ^'k  , k - 1 *uk  - 1 ,m  k - m * k+63 


(10) 


As  can  be  seen,  the  u elements  that  participated  in  the  vector 
multiply  were  streaming  from  main  memory  into  the  multiply  pipe-line  with 
f^nal  accumulation  through  the  subtract  pipeline.  In  this  way  up  to  64 
inner  products  arc  computed  concurrently.  The  2.  elements  were  computed 
similarly. 
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Figure  2.  k**'  pivot  inner  product  calculation 

In  processing  a matrix  that  is  too  large  to  be  contained  in  main 
memory,  there  will  be  no  time  available  tor  the  I/O  system  to  access 
memory  while  the  inner  loop  is  executing.  As  a result,  the  arith- 
metic and  I/O  operations  cannot  be  concurrent  and  the  factorization 
is  slowed. 

B.  A reduced  data  flow  algorithm 

The  alternative  algorithm  this  report  explores  is  based  on  a 
technique  due  to  Pavkovich  [3].  1'his  algorithm  employs  the  vector 
registers  in  such  a way  so  as  to  reduce  memory  references  to  about 
202  of  what  was  required  in  the  algorithm  described  above.  This 
is  accomplished  by  completing  five  rows  of  the  matrix  for  each 
multiplicative  use  of  a U vector  rather  than  completing  only  a single 
row.  This  should  leave  the  I/O  system  sufficient  memory  access  time 
to  allow  the  movement  of  portions  of  the  matrix  to  and  from  the 
disk  without  slowing  the  functional  units. 
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The  arithmetic  computation  for  this  algorithm  is  numerically 
identical  to  what  was  described  above.  The  sequence  of  operations  is 
changed  with  the  effect  being  a block-wise  reduction  of  the  matrix. 
With  reference  to  Figure  3,  the  following  sequence  is  performed  at 

the  ktn  block  step. 

1)  Reduce  the  q x q diagonal  block  comprising  elements  a.  . , 

k<i<k+q-l,  k<j<k+q-l.  1 * ^ 

2)  Reduce  the  q x p row  blocks.  The  last  row  block  may  have 
fewer  than  p column  elements. 

3)  Reduce  the  p x q column  blocks.  The  last  column  block  may 
have  fewer  than  p row  elements. 

4)  Repeat  steps  1-3  with  the  remaining  k-q  x k-q  sub-matrix. 

5)  The  factorization  will  be  complete  with  the  reduction  of  the 
south-east  corner  diagonal  block.  This  block  may  have 
fewer  than  q row  and  column  elements. 

For  the  present,  only  the  reduction  of  the  q x p rowr  blocks 
and  the  p x q column  blocks  will  be  discussed. 


The  structure  of  the  off-diaeonal  blocks  and  the  matrix 
traversal  scheme  was  chosen  with  the  algorithm's  implementation 
on  the  Cray-1  in  mind.  In  this  way  one  is  always  dealing  with 
full  length  vectors  except  at  a row  or  column  end.  The  detail  of 
a row  block  appears  in  Figure  4. 


-J 


I J ! J 


□ 


1C 


J L 


* ' *L 


X 


q row 
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up  to  p column  elements 


Figure  4.  Row  block  detail 


11 


I 

I 


C.  Row  block  reduction 

This  block  is  read  from  memory  into  t lie  vector  registers  of 
the  Cray-1.  As  a result,  the  limitation  on  p is  64  column  elements, 
this  being  the  number  of  storage  locations  in  a vector  register. 

With  eight  vector  registers  available,  three  of  which  are  used  to 
perform  the  inner  product  calculation,  q has  the  value  of  five. 

The  processing  of  a typical  row  block  proceeds  as  follows. 

First  the  q x p block  to  be  reduced  is  loaded  into  the  q vector 
registers  set  aside  for  that  purpose.  These  vector  registers  will 
be  denoted  B.  to  B . The  inner  loop  operation  sequence  is  carried 
out  for  1 < i < k - 1 (see  Figure  5). 

1)  Load  RV.  into  the  row  vector  register,  ^rv- 

2)  Load  SG^  into  the  scalar  group  registers,  Sj , 1 < j < q. 

3)  With  the  data  now  present  in  registers,  q computation 
sequences  can  be  carried  out.  Each  computation  sequence 
is  of  the  form 


B^  = B 
ill  m 


Sm  x Vrv  ’ 1 5 m - q‘ 


The  computation  represented  by  this  equation  is  performed 
element  by  clement  on  each  of  the  p components  of  the 
vector  register. 

This  completes  all  computation  external  to  the  reduction  block. 

Consequently,  only  the  F^  block  vector  is  completely  reduced,  with 

...  B not  yet  finished.  The  reduction  is  completed  by  using 
l q 

reduced  F vectors  internal  to  the  block  to  reduce  the  balance  of 
the  block. 


i 

I 


I 

: 


i 
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vectors 


The  following  sequence  is  carried  out  for  1 < i < q-1  to  complete 
the  block  reduction. 


1)  Load  SG j,  + i j into  the  scalar 

2)  With  the  data  now  present  in 
sequences  can  be  done.  Each 
the  form 


group  registers,  S^.-i+l  < j < q. 
registers,  q-i  computation 
computation  sequence  is  of 


B = B -S  xB.  ,i+l<m<q. 
m m m i - n 

At  the  i*  completion  of  this  sequence,  Bh+..  is  fully  reduced 
and  can  be  returned  to  memory  and  also  used  to  complete  the  remain- 
ing B \ectors. 


n.  Accounting  for  memory  access  reduction 

Tc  justify  the  claim  that  main  memory  accesses  are  reduced  using 
this  algorithm  it  is  necessary  to  determine  the  ratio  (p)  of 
multiplies  (additions  are  essentially  free)  to  main  memory  accesses. 

Table  1 delineates  the  quantities  involved.  The  quantity  r is 
tiie  number  of  row  vectors  external  to  the  reduction  block.  The 
ratio  is  then 

q/2(q+2r-ljp  arithmetic  operations 

p 

q/2 (q+2r- 1 ) + (r+ 2q) p memory  accesses 


qCq-i) 

Scalar  fetches:  qr  + 

2 

Vector  element  fetches:  (r+qjp 
Vector  clement  stores:  qp 

q(q-i) 

Multiplies  required:  [qr  + ]p 

2 

Table  1.  Ratio  quantities 


t 


The  number  of  multiplies  necessary  is  essentially  a function  of  the 
number  of  scalars  involved,  which  is  primarily  a function  of  r since  q 
is  fixed  in  the  implementation.  Asymptotically,  this  ratio  becomes 


p _ Limp  _ 
a 

r ->  qo  q + p 


operations/access 


Also  note  that  as  p,  which  is  the  vector  register  length,  increases 


= q operations/access 


P -►  CO 


we  asymptotically  can  achieve  no  better  than  q arithmetic  operations 
per  memory  access,  indicating  that  longer  vector  registers  would 
not  improve,  this  aspect  of  the  algorithm. 

Using  values  for  q = 5 and  p = 64  as  in  the  Cray-1  implement- 
ation, the  asymptotic  performance  is  p = 4.63.  As  the  vector 
register  length  increases  the  effect  of  the  q scalar  loads  diminish- 
es and  p asymptote?  to  3 operat ions/ (memory  access'). 

Contrasted  with  q = l and  p=64  used  in  the  original  inner  product 
algorithm  where  p=.9S4  operat i ons/ (memory  access),  the  improved  al- 
gorithm yields  a 78.7"  reduction  in  memory  accesses. 

E.  Algorithm  impl cmentat ion  on  the  Crav-1 

The  data  flow  of  the  inner  loop,  which  performs  the  block 
reduction,  is  quite  straight  forward.  Figure  6 a illustrates  the 
data  flow  through  the  vector  registers  as  the  processing,  external 
to  the  block,  is  carried  out. 

The  physical  vector  registers  in  the  Cray-1  are  assigned  names 
VO  through  V7 . The  contents  of  each  vector  register  has 
been  assigned  a label  with  an  overbar  indicating  what  is  currently 
stored  there,  (i.e.,  Bj  through  B,.  refer  to  the  five  vectors 
comprising  the  reduction  block,  RVh  refers  to  the  ith  row  vector 
residing  in  a vector  register , etc) . 
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In  the  allocation  of  the  vector  registers,  six  (V1-V6)  are  assigned 
to  the  reduction  block,  of  which  five  contain  the  actual  data  of 
the  block  (q=5)  with  a sixth  (labeled  free)  being  used  to  allow 
accumulation  into  the  block  of  B vectors.  One  of  the  remaining 
two  vector  registers  (V7)  is  used  for  holding  the  row  vectors 
RV^,  1 < i < k-1  as  they  are  read  one  at  a time  from  memorv. 

The  other  (VO)  is  used  to  contain  the  product  vector  PV  which 
is  computed  from  RV^  by  a scalar  mu  1 1 ipl ica t ion . 

When  all  external  k-1  row  vectors  are  processed,  the  contents 
of  the  vector  registers  will  be  in  one  of  two  states  as  depicted  in 
Figure  6b.  Which  k-1  end  state  results  is  governed  by  whether  the 
number  of  k-1  row  vectors  processed  is  odd  or  even. 


There  are  two  similar  code  sequences  used  in  processing 
row  vectors: 

1)  accumulation  of  the  B j - B^  block  into  vector  registers  V2-V6; 

2)  accumulation  of  the  block  into  vector  registers  V1-V5.  This 
accumulation  is  depicted  in  Figure  6a  by  " * " symbol.  Initially 
vector  registers  V1-Y5  must  be  loaded  with  the  reduction  block 
from  main  memory,  which  is  depicted  by  the  " M-*-  " symbol.  Then 
the  first  row  vector  RVj  (Figure  5)  must  be  loaded  into  V7  from 
main  memory.  Then  the.  last  scalar  of  scalar  group  one  (SG^)  is 
loaded  and  the  product  of  RY ^ and  the  scalar  is  placed  in  VO 

(PY)  which  is  depicted  by  the  " " symbol.  This  product  is  then 

accumulated  with  BY  (in  Y5)  and  directed  to  Y6 . This  row  vector 
is  then  used  with  the  remaining  four  scalars  working  up  the  scalar 
group.  This  leaves  VI  free  to  propagate  results  in  the  reverse 
direction  with  the  next  row  vector. 

What  follows  is  the  symbolic  layout  of  vector  instructions 
that  reflect  the  operation  sequence  of  the  external  row  block 
processing  (refer  to  Figures  5 and  6a).  Only  the  pertinent 
instructions  have  been  included.  Scalar  code  necessary  for 
address  generation  and  loop  control  have  been  omitted  and  the 
braces  indicate  chained  instructions. 
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The  column  block  reduction  uses  a similar  scheme  with  the 
only  alteration  being  where  in  memory  the  matrix  data  is  coming 


from . 

At  this  stage  of  the  algorithm  all  the  external  row  vectors 
have  been  processed  with  the  contents  of  the  vector  registers  in 
one  of  the  two  k-1  states  depicted  in  Figure  6b.  As  discussed 
earlier,  the  rest  of  the  row  vectors  come  from  inside  the  reduction 
block  to  complete  the  blocks  reduction.  In  fact,  is  now  com- 

pletely reduced  and  will  be  employed  as  the  next  row  vector  to 
further  reduce  > B,  TTj  and  B . Figures  7a  and  b show  the  internal 
block  reduction  data  flow  starting  from  either  of  the  k-1  ending  states 
of  the  external  block  reduction.  This  internal  block  reduction  is 
referred  to  as  draining  the  block. 
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At  each  drain  state  the  completion  of  a B vector 
is  indicated  by  a circle.  In  the  subsequent  states,  as  the 
rest  of  the  B vectors  are  completed  they  are  success ivel y re- 
turned to  memory  ("-►M  symbolizes  the  store  of  the  vector  register 
to  memory).  The  odd  and  even  k-I  states  give  rise  to  two  block 
drain  codes. 

What  follows  is  the  symbolic  layout  of  vector  instructions 
necessary  for  the  draining  of  a row  block  starting  from  the  odd 
k-1  state.  This  data  flow  is  depicted  in  Figure  7a.  Also  refer 
to  Figure  5.  The  draining  starting  from  the  even  k-1  state  is 
similar  in  spirt  as  is  the  draining  of  a column  reduction  block. 
Again,  scalar  code  for  address  generation  has  been  omitted  for 
the  sake  of  clarity  and  the  braces  indicate  chained  or  concurrent 
instruction  sequences. 
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Once  all  the  row  vectors  comprising  the  block  have  been  returned 
to  memory  the  block  reduction  is  complete.  Depending  on  whether 
row  or  column  pivoting  is  desired  an  optional  pivot  multiplication 
can  be  inserted  before  the  vector  stores  in  the  appropriate  drain 
rout ine. 

Since  there  is  only  one  path  to  the  Cray-1  main  memory  from 
the  computational  section  the  scalar  fetches  should  be  placed  in 
such  a way  so  as  to  avoid  conflicting  with  the  block  vector  stores 
and  thus  causing  a halt  to  instruction  issuing. 

It  is  easy  to  see  that,  when  a row  vector  is  read  from  main 
memory  each  element  is  employed  in  five  multiplies  giving  rise 
to  the  asymptotic  limit  of  p - 5 as  p •+  °°. 
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IV. 


Comment  a ry 


A.  Further  algorithmic  considerations 

The  problem  of  pivoting  has  been  ignored  in  this  report, 
although  it  involves  data  restructuring  and  hence  can  influence 
data  flow.  It  would  be  considered  when  the  more  general  problem 
of  data  formatting  was  being  investigated.  Here,  one  would  be 
concerned  with  the  formatting  of  A,  L,  and  H in  the  disc,  main 
memory,  and  vector  registers. 

Since  the  disc  is  read  and  written  sequentially  it  must  share 
at  least  a common  block  format  with  main  memory.  However,  the 
data  may  be  reformatted  between  memory  and  the  registers,  since 
a matrix  block  in  memory  may  be  accessed  both  row-  and  column- 
wise. Also,  A and  L U need  not  be  identically  formatted;  the 
first  may  be  determined  by  user  convenience  whereas  L and  U are 
internal  to  the  equation  solution  algorithm.  The  result  may  be 
that  the  alternate  row-  and  column-strip  access  of  Figure  3 may 
be  inconvenient;  this  would  not  seriously  impact  the  interior 
loop  where  the  reduced  memory  accesses  are  achieved. 

B.  Speedup  by  architectural  modifications 

1 . Introduction 

With  the  processor -memory  path  busied  between  1/4  and  1/5  of 
the  time  in  support  of  the  inner  loops,  the  memory  is  free  to  commun- 
icate with  the  instruction  parcel  buffers,  the  B and  T registers, 


1 


arid  (most  importantly)  the  backing  store  the  majority  of  the  time. 
We  have  shown  that  lack  of  disc  capacity  can  impede  computation 
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overall;  however  even  if  solution  time  becomes  I/O  dominated 
so  that  I/O  traffic  is  maximized,  it  can  be  readily  shown  that 
memory  will  be  occupied  at  most  m/144  of  the  time  emptying  I/O 
buffers.  Adding  this  to  the  above  fractions,  it  seems  safe  to 
assume  that  1/2  - 2/3  of  the  memory  bandwidth  is  unused  for  all 
purposes  presently  accounted  for. 

In  the  following  sections,  we  suggest  architectural  modifica- 
tions to  utilize  this  bandwidth  to  speedup  the  equation  solution. 

2.  Register  control 

Presently,  access  to  elements  of  a vector  register  are  con- 
trolled by  the  register  with  a single  counter.  If  the  access 
control  were  moved  from  the  vector  register  to  the  functional 
unit,  the  same  vector  register  could  be  concurrently  accessed 
by  multiple  functional  units.  This  would  permit  self  accumulation 
of  a vector  register  into  itself,  allowing,  for  example,  incre- 
mentation as  VI  «-  VI  + C.  The  register  propagation  scheme  of 
the  proposed  algorithm  where  one  vector  register  was  added  to 
another  could  be  avoided,  allowing  a q of  6 and  a further  reduc- 
tion in  memory  accesses. 

3.  Expanded  functional  units 

A more  important  consequence  of  excess  memory  bandwidth  and 
modified  register  control  would  be  the  ability  to  add  functional 
units  for  increased  parallelism.  If  each  multiply-add  unit 
required  the  same  fraction  of  memory  bandwidth  as  determined 
above , possibly  two  or  three  units  could  be  added  without  inter- 
ference. One  possibility  for  accomplishing  this  without  changing 
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■■m 


instruction  formats  and  thereby  making  present  codes  upward 
compatible  is  now  described. 

The  data  flow  is  symbolically  diagrammed  in  Figure  8 with  the 
vector  instruction  sequence  that  produced  it. 


Issued  Issued 
First  Second 


Issued 

Third 


S1 

S2 

s'| 

kv(0) 

...  M-J 

PV(O) 

five  l ) 

S0-  — ■ 

rv(  i ) 

KV(2) 

_ 

' ; [ 

-*i  1 

PVf  ( 2 ) 

— 

- KV 

Broadcast  Multiply 

■ PV 



! co) 


1l 


Instruction  Sequence 
PV-  S , * iTV 


These  3 instruction 
I'WS  *kT)  issue  in  3 successive 
FvVs^Rvj  clock  Periods. 


J2(0)  J 

k 5,(1)  ^ 


IV1.)! , 

J 11,(2)11 
!— i--J  1 


~L 


1T.(0) 


I 1 


Self  Accumulation 


This  issues  at  the  chain  slot  arrival  of  data  from  the  S,  multiply. 
E^B^-PY  Tiiis  issues  at  the  next  chain  slot  from  the  multiply. 

Issues  at  Sj  multiply  chain  slot. 


Figure  8.  Symbolic  data  flow  of  parallel  inner  product. 


Essentially,  three  inner  products  are  computed  in  parallel  with 
a clock  period  displacement  in  the  elements  being  processed.  The 
first  multiply  issued  would  be  one  clock  period  ahead  of  the  second 
one  issued.  The  second  in  turn  tvould  he  one  clock  period  ahead 
of  the  third.  The  first  product  to  arrive  at  PV(0)  (The  notation 
PV(0)  refers  to  element  zero  for  vector  register  PV)  would  signal 
chain  slot  time  for  the  waiting  Bj  accumulation  instruction.  This 


instruction  would  issue  with  the  product  being  fed  into  the  adder. 

In  the  next  clock  period  PV(1)  would  receive  the  first  multiply's 
second  product  which  would  move  on  to  B^'s  adder,  and  so  forth. 

Also,  in  this  clock  period,  PY(0)  w'ould  receive  the  second  multi- 
ply’s  first  product  thereby  signaling  chain  slot  time  for  the  now 
waiting  B?  accumulation  instruction,  causing  it  to  issue.  In 
the  next  clock  period  the  third  multiply's  first  product  will 
arrive  at  PV(0)  initiating  the  accumulation.  In  succeeding 
clock  periods,  three  products  will  simultaneously  arrive  at  PV 
and  then  be  passed  on  to  the  three  adders  for  B vector  accumulation. 

This  scheme  is  not  without  its  drawbacks.  The  ordering  and 
placement  of  vector  instructions  has  to  be  critically  observed 
noting  all  instruction  issue  delays  to  insure  correct  chain  slot 
time  synchronisation.  When  using  a vector  register  (PV)  in  this 
multi-chain  through  mode,  missing  a chain  slot  time  would  produce 
an  entirely  different  calculation,  furthermore,  external  intcrupts 
(I/O,  timer,  etc.)  would  have  to  be  disallowed  during  a multi-store 
use  of  any  vector  register,  thereby  anticipating  the  invocation  of 
multi-chain  through  mode  by  subsequent,  as  yet  unseen,  vector 
instructions . 

4.  Scalar  register  chaining 

When  intermediate  results  of  a chained  computation  arc  not 
required,  it  is  undesireable  to  allow  them  to  occupy  a vector  reg- 
ister. One  technique  might  be  to  allow  chaining  through  scalar 
registers.  When  a vector  result  is  passed  to  a chain  through 
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modifying  and  expanding  the  processor  architecture. 


scalar  register,  it  would  simply  move  on  to  the  next  functional 
unit  of  the  chained  instruction  sequence.  The  following  hypo- 
thetical instruction  sequence  is  an  example  of  using  a scalar 
register  in  this  manner. 

SI*-  S2*RV  Begin  production  of  product  element 

V6*-  V5  + S1  Accumulate  these  products 
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